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Abstract 

Inspired by recent advances in multimodal learning and machine translation, we 
introduce an encoder-decoder pipeline that learns (a): a multimodal joint embed- 
ding space with images and text and (b): a novel language model for decoding 
distributed representations from our space. Our pipeline effectively unifies joint 
image-text embedding models with multimodal neural language models. We in- 
troduce the structure-content neural language model that disentangles the structure 
of a sentence to its content, conditioned on representations produced by the en- 
coder. The encoder allows one to rank images and sentences while the decoder 
can generate novel descriptions from scratch. Using LSTM to encode sentences, 
we match the state-of-the-art performance on FlickrSK and FlickrSOK without 
using object detections. We also set new best results when using the 19-layer Ox- 
ford convolutional network. Furthermore we show that with linear encoders, the 
learned embedding space captures multimodal regularities in terms of vector space 
arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. 
Sample captions generated for 800 images are made available for comparison. 



1 Introduction 

Generating descriptions for images has long been regarded as a challenging perception task integrat- 
ing vision, learning and language understanding. One not only needs to correctly recognize what 
appears in images but also incorporate knowledge of spatial relationships and interactions between 
objects. Even with this information, one then needs to generate a description that is relevant and 
grammatically correct. With the recent advances made in deep neural networks, tasks such as object 
recognition and detection have made significant breakthroughs in only a short time. The task of 
describing images is one that now appears tractable and ripe for advancement. Being able to append 
large image databases with accurate descriptions for each image would significantly improve the 
capabilities of content-based image retrieval systems. Moreover, systems that can describe images 
well, could in principle, be fine-tuned to answer questions about images also. 

This paper describes a new approach to the problem of image caption generation, casted into the 
framework of encoder-decoder models. For the encoder, we learn a joint image- sentence embedding 
where sentences are encoded using long short-term memory (LSTM) recurrent neural networks |T1. 
Image features from a deep convolutional network are projected into the embedding space of the 
LSTM hidden states. A pairwise ranking loss is minimized in order to learn to rank images and their 
descriptions. For decoding, we introduce a new neural language model called the structure-content 
neural language model (SC-NLM). The SC-NLM differs from existing models in that it disentangles 
the structure of a sentence to its content, conditioned on distributed representations produced by the 
encoder. We show that sampling from an SC-NLM allows us to generate realistic image captions, 
significantly improving over the generated captions produced by |2|. Furthermore, we argue that 
this combination of approaches naturally fits into the experimentation framework of |3 |, that is, a 
good encoder can be used to rank images and captions while a good decoder can be used to generate 
new captions from scratch. Our approach effectively unifies image-text embedding models (encoder 
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there is a cat 
sitting on a slielf . 



a plate witli a forl< 
and a piece of cal<e . 



a blacl< and winite 
plioto of a window 



a young boy standing a wooden table 



on a parking lot 
next to cars . 



a kitchen with 
stainless steel 
appliances . 



this is a herd 
of cattle out 
in the field . 



a car is parked 
in the middle 
of nowhere . 




a giraffe is standing 
next to a fence 
in a field . 

(hallucination) 



the two birds are 
trying to be seen 
in the water . 
(counting) 



a parked car while 
driving down the road . 

(contradiction) 



a ferry boat on 
a marina with a 
group of people . 



the handlebars 
are trying to ride 
a bike rack . 
(nonsensical) 



and chairs arranged 
in a room . 




a little boy with 
a bunch of friends 
on the street . 




a woman and 
a bottle of wine 
in a garden . 
(gender) 



Figure 1 : Sample generated captions. The bottom row shows different err or cases. A dditional results 
can be found at ht tp : / / www . cs . tor onto . edu/ ~rkiros/lstm_scnlm . html| 



phase) ||4l|5l[6l with multimodal neural language models (decoder phase) O 171. Furthermore, our 
method builds on analogous approaches being used in machine translation 181191 [TOlfTTl . 

While the application focus of our work is on image description generation and ranking, we also 
qualitatively analyse properties of multimodal vector spaces learned using images and sentences. We 
show that using a linear sentence encoder, linguistic regularities 1 12] also carry over to multimodal 
vector spaces. For example, *image of a blue car* - "blue" + "red" results in a vector that is near 
images of red cars. We qualitatively examine several types of analogies and structures with PC A 
projections. Consequently, even with a global image- sentence training objective the encoder can still 
be used to retrieve locally ( e.g. individual words). This is analogous to pairwise ranking methods 
used in machine translation |[l3lfT4ll . 



1.1 Multimodal representation learning 

A large body of work has been done on learning multimodal representations of images and text. 
Popular approaches include learning joint image- word embeddings ||4| O as well as embedding 
images and sentences into a common space |i6||15J. Our proposed pipeline makes direct use of 
these ideas. Other approaches to multimodal learning include the use of deep Boltzmann machines 
HH, log-bilinear neural language models |2|, autoencoders 1 17 1, recurrent neural networks |3 and 
topic-models 1 18 1. Several bi-directional approaches to ranking images and captions have also been 
proposed, based off of kernel CCA |3 1, normalized CCA |[T9l and dependency tree recursive net- 
works |6|. From an architectural standpoint, our encoder-decoder model is most similar to 1201 . who 
proposed a two-step embedding and generation procedure for semantic parsing. 



1.2 Generating descriptions of images 

We group together approaches to generation into three types of methods, each described here in 
more detail: 

Template-based methods. Template-based methods involve filling in sentence templates, such as 
triplets, based on the results of object detections and spatial relationships ||2T||22l|23]|24l[25l. While 




CNN - LSTM Encoder 



Figure 2: Encoder: A deep convolutional network (CNN) and long short-term memory recurrent 
network (LSTM) for learning a joint image- sentence embedding. Decoder: A new neural language 
model that combines structure and content vectors for generating words one at a time in sequence. 



these approaches can produce accurate descriptions, they are often more 'robotic' in nature and do 
not generalize to the fluidity and naturalness of captions written by humans. 

Composition-based methods. These approaches aim to harness existing image-caption databases 
by extracting components of related captions and composing them together to generate novel de- 
scriptions 1 26, 27 1 . The advantage of these approaches are that they allow for a much broader 
and more expressive class of captions that are more fluent and human-like then template-based ap- 
proaches. 

Neural network methods. These approaches aim to generate descriptions by sampling from condi- 
tional neural language models. The initial work in this area, based off of multimodal neural language 
models |2|, generated captions by conditioning on feature vectors from the output of a deep con- 
volutional network. These ideas were recently extended to multimodal recurrent networks with 
significant improvements |7 |. The methods described in this paper produce descriptions that at least 
qualitatively on par with current state-of-the-art composition-based methods 127 1. 

Description generation systems have been plagued with issues of evaluation. While Bleu and Rouge 
have been used in the past, 1 3 1 has argued that such automated evaluation methods are unreliable 
and do not match human judgements. These authors instead proposed that the problem of ranking 
images and captions can be used as a proxy for generation. Since any generation system requires a 
scoring function to access how well a caption and image match, optimizing this task should naturally 
carry over to an improvement in generation. Many recent methods have since used this approach 
for evaluation. None the less, the question on how to transfer improvements on ranking to gen- 
erating new descriptions remained. We argue that encoder-decoder methods naturally fit into this 
experimentation framework. That is, the encoder gives us a way to rank images and captions and 
develop good scoring functions, while the decoder can use the representations learned to optimize 
the scoring functions as a way of generating and scoring new descriptions. 

1.3 Encoder-decoder methods for machine translation 

Our proposed pipeline, while new to caption generation, has already experienced several successes in 
Neural Machine Translation (NMT). The goal of NMT is to develop an end-to-end translation system 
with a large neural network, as opposed to using a neural network as an additional feature function 
to an existing phrase-based system. NMT methods are based on the encoder-decoder principle. 
That is, an encoder is used to map an English sentence to a distributed vector. A decoder is then 
conditioned on this vector to generate a French translation from the source text. Current methods 
include using a convolutional encoder and RNN decoder 1 8 1, RNN encoder and RNN decoder 191 [TOl 
and LSTM encoder with LSTM decoder 1 1 1 1. While still a young research area, these methods have 
already achieved performance on par with strong phrase-based systems and have improved on the 
start-of-the-art when used for rescoring. 

We argue that it is natural to think of image caption generation as a translation problem. That is, 
our goal is to translate an image into a description. This point of view has also been used by [ 281 
and allows us to make use of existing ideas in the machine translation literature. Furthermore, there 
is a natural correspondence between the concept of scoring functions (how well does a caption and 
image match) and alignments (which parts of a description correspond to which parts of an image) 
that can naturally be exploited for generating descriptions. 
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2 An encoder-decoder model for ranking and generation 



In this section we describe our image caption generation pipeline. We first review LSTM RNNs 
which are used for encoding sentences, followed by how to learn multimodal distributed represen- 
tations. We then review log-bilinear neural language models 1291 , multiplicative neural language 
models L30J and then introduce our structure-content neural language model. 

2.1 Long short-term memory RNNs 

Long short-term memory (TJ is a recurrent neural network that incorporates a built in memory cell 
to store information and exploit long range context. LSTM memory cells are surrounded by gat- 
ing units for the purpose of reading, writing and reseting information. LSTMs have been used to 
achieve state-of-the-art performance in several tasks such as handwriting recognition | 31 1, sequence 
generation |32| speech recognition |33 | and machine translation |11| among others. Dropout [i34ll 
strategies have also been proposed to prevent overfitting in deep LSTMs. ll35l 

Let denote a matrix of training instances at time t. In our case, is used to denote a 
matrix of word representations for the t-th word of each sentence in the training batch. Let 
(I^, F^, Ct, Ot, M^) denote the input, forget, cell, output and hidden states of the LSTM at time 
step t. The LSTM architecture in this work is implemented using the following equations: 



where (a) denotes the sigmoid activation function, (•) indicates matrix multiplication and (•) indi- 
cates component- wise multiplication. 

2.2 Multimodal distributed representations 

Suppose for training we are given image-description pairs each corresponding to an image and a 
description that correctly describes the image. Images are represented as the top layer (before the 
softmax) of a convolutional network trained on the ImageNet classification task 1361 . 

Let D be the dimensionality of an image feature vector (e.g. 4096 for AlexNet |36|), K the di- 
mensionality of the embedding space and let V be the number of words in the vocabulary. Let 
W/ G R^^^ and Wt G R^^^ be the image embedding matrix and word embedding matri- 
ces, respectively. Given an image description S = {wi^ . . . ^wn} with words . . . , i(;Ar, [^let 
{wi, . . . , Wat}, G R^, i = 1, . . . , n denote the corresponding word representations to words 
wi, . . . ^wn (entries in the matrix Wt). The representation of a sentence v is the hidden state of 
the LSTM at time step N (i.e. the vector m^). We note that other approaches for computing sentence 
representations for image-text embeddings have been proposed, including dependency tree RNNs 
| 6 | and bags of dependency parses 1 15 |. Let q G R^ denote an image feature vector (for the image 
corresponding to description S) and let x = W/ • q G R^ be the image embedding. We define a 
scoring function ^(x, v) = x • v, where x and v are first scaled to have unit norm (making s equiv- 
alent to cosine similarity). Let 6 denote all the parameters to be learned (W/ and all the LSTM 
weights) We optimize the following pairwise ranking loss: 

min ^ ^ max{0, a — 5(x, v) + s(x, Vfc)} + ^ ^ max{0, a — s(v, x) + ^(v, x^)} (6) 

X V 

where v/^ is a contrastive (non-descriptive) sentence for image embedding x, and vice- versa with x/^. 
For all of our experiments, we initialize the word embeddings Wt to be pre-computed K = 300 
dimensional vectors learned using a continuous bag-of-words model |37|. The contrastive terms are 
chosen randomly from the training set and resampled every epoch. 

^For additional details on LSTM: http : / /people . idsia . ch/~ juergen/rnn . html 

^As a slight abuse of notation, we refer to Wi as both a word and an index into the word embedding matrix. 

^We keep the word embedding matrix Wt fixed. 



It 

Ft 
Ct 
Ot 
Mt 



a{Xt ■ W^i + Mt_i ■ Whi + Ct-i ■ Wei + hi) 

a{Xt ■ W,/ + Mt_i ■Whf+ Ct-i ■ We/ +hf) 

Ft • Ct_i + It • tanhiXt ■ W^c + Mt-i ■ Who + be) 

(t(X( • W^o + Mt-i ■ Who + Ct ■ Weo + be) 
Ot • tanh{Ct) 



(1) 
(2) 
(3) 
(4) 
(5) 
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2.3 Log-bilinear neural language models 

The log-bilinear language model (LBL) f29l is a deterministic model that may be viewed as a feed- 
forward neural network with a single linear hidden layer. Each word w in the vocabulary is repre- 
sented as a iiT-dimensional real-valued vector w G M^, as in the case of the encoder. Let R denote 
SiV X K matrix of word representation vectors]^ where V is the vocabulary size. Let {wi^ . . . Wn-i) 
be a tuple of n — 1 words where n — 1 is the context size. The LBL model makes a linear prediction 
of the next word representation as 

-1 

'^'^Wi, (7) 

where C^^^ z = 1, n — 1 are i^T x i^T context parameter matrices. Thus, r is the predicted 
representation of w^. The conditional probability P{wn = i\wi:n-i) of Wn given . . . , Wn-i is 

D/ -I A exp(Fri + 6i) 

Ei=iexp(r^r^- +6^-) 
where b G MX is a bias vector. Learning is done with stochastic gradient descent. 

2.4 Multiplicative neural language models 

Suppose now we are given a vector u G from the multimodal vector space, which has an 
association with a word sequence S = {wi^ . . . ^wn}- For example, u may be the embedded 
representation of an image whose description is given by 6*. A multiplicative neural language model 
1 30] models the distribution P{wn = i|K;i:n-i , u) of a new word Wn given context from the previous 
words and the vector u. A multiplicative model has the additional property that the word embedding 
matrix is instead replaced with a tensor T G R^x^x^' where G is the number of slices. Given u, 

we can compute a word representation matrix as a function of u as = Ylf=i ^i^^^^ i-^- word 
representations with respect to u are computed as a linear combination of slices weighted by each 
component Ui of u. Here, the number of slices G is equal to K, the dimensionality of u. 

It is often unnecessary to use a fully unfactored tensor. As in e.g. (381 [39l, we re-represent T in 
terms of three matrices W^^ G M^><^, W-^^ G M^><^ and W-^^ G M^><^, such that 

= (W-^^)^ • diag(W'^^u) • W-^^ (9) 

where diag(-) denotes the matrix with its argument on the diagonal. These matrices are parametrized 
by a pre-chosen number of factors F. In [30 1, the conditioning vector u is referred to as an attribute 
and using a third-order model of words allows one to model conditional similarity: how meanings 
of words change as a function of the attributes they're conditioned on. 

Let E = (W-^^)^W-^^ denote a 'folded' K xV matrix of word embeddings. Given the context 
wi^ . . . ^ Wn-i, the predicted next word representation f is given by: 

n-1 

f =^C«E(:,«;i), (10) 

i=l 

where F/{:^Wi) denotes the column of E for the word representation of Wi and C^'^\i = 1, . . . , n — 1 
Sive K X K context matrices. Given a predicted next word representation f , the factor outputs 
are f = (W-^^f) • (W-^^u), where • is a component- wise product. The conditional probability 

P{wn = i\wi:n-i^ u) of Wn givcn iL'i, . . . , Wn-1 and u Can be written as 

exp((W/-(:,0)^f + 



P(Wn = i\wi-n-l,u) = , 

Er=iexp((W/"(:,j))Tf + 6,.) 

where W-^^(:, z) denotes the column of W-^^ corresponding to word i. In contrast to the log-bilinear 
model, the matrix of word representations R from before is replaced with the factored tensor T that 
we have derived. We compared the multiplicative model against an additive variant 1 2 1 and found on 
large datasets, such as the SBU Captioned Photo dataset |40|, the multiplicative variant significantly 
outperforms its additive counterpart. Thus, the SC-NLM is derived from the multiplicative variant. 



"^Note that this is a different matrix then that used by the encoder. We use the same vocabulary throughout 
both models. 
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context attribute context attribute 




word distribution word dist. struct content vbn 

(a) Multiplicative NLM (b) Structure-content NLM (c) SC-NLM prediction 



Figure 3: Left: multiplicative neural language model. Middle: Structure-content neural language 
model (SC-NLM). Right: The prediction problem of an SC-NLM. 



2.5 Structure-content neural language models 



We now describe the structure-content neural language model. Suppose that, along with a de- 
scription S = {wi^ . . . ^wn}, we are also given a sequence of word-specific structure variables 
T = {ti, . . . , ^at}. Throughout our experiments, each ti corresponds to the part-of-speech for word 
Wi, although other possibilities can be used instead. Given an embedding u (the content vector), our 
goal is to model the distribution P{wn = i|^i:n-i, ^n:n+/c, u) from previous word context wi:n-i 
and forward structure context tn-.n+k^ where k is the forward context size. Figure [3] gives an illus- 
tration of the model and prediction problem. Intuitively, the structure variables help guide the model 
during the generation phrase and can be thought of as a soft template to help avoid the model from 
generating grammatical nonsense. Note that this model shares a resemblance with the NNJM of ll4T1l 
for machine translation, where the previous word context are predicted words in the target language, 
and the forward context are words in the source language. 

Our model can be interpreted as a multiplicative neural language model but where the attribute 
vector is no longer u but instead an additive function of u and the structure variables T. Let 
}, ti G M^, i = n, . . . , n + /c be embedding vectors for the structure variables T. 
These are obtained from a learned lookup table in the same way as words are. We introduce a se- 
quence of G X G structure context matrices T'^*^ z = n, . . . , n + /c which play the same role as the 
word context matrices C^*^. Let denote sl G x K context matrix for the multimodal vector u. 
The attribute vector u of combined structure and content information is computed as 



(11) 



where [•]+ = max{-^ 0} is a ReLU non-linearity and b is a bias vector. The vector u now plays the 
same role as the vector u for the multiplicative model previously described and the remainder of the 
model remains unchanged. Our experiments use G = K = 300 and factors F = 100. 

The SC-NLM is trained on a large collection of image descriptions (e.g. FlickrSOK). There are 
several choices available for representing the conditioning vectors u. One choice would be to use 
the embedding of the corresponding image. An alternative choice, which is the approach we take, is 
to condition on the embedding vector for the description S computed with the LSTM. The advantage 
of this approach is that the SC-NLM can be trained purely on text alone. This allows us to make 
use of large amounts of monolingual text (e.g. non image captions) to improve the quality of the 
language model. Since the embedding vectors of S share a joint space with the image embeddings, 
we can also condition the SC-NLM on image embeddings (e.g. at test time, when no description 
is available) after the model has been trained. This is a significant advantage over a conditional 
language model that explicitly requires image-caption pairs for training and highlights the strength 
of a multimodal encoding space. 

Due to space limitations, we leave the full details of our caption generation procedure to the supple- 
mentary material. 
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FlickrSK 



Model 


R@l 


Image Annotation 
R@5 R@10 


Med r 


R@l 


Image Search 
R@5 R@10 


Med r 


Random Ranking 


0.1 


0.6 


1.1 


631 


0.1 


0.5 


1.0 


500 


SDT-RNN|6| 


4.5 


18.0 


28.6 


32 


6.1 


18.5 


29.0 


29 


t DeViSE | 5 1 


4.8 


16.5 


27.3 


28 


5.9 


20.1 


29.6 


29 


t SDT-RNN | 6 | 


6.0 


22.7 


34.0 


23 


6.6 


21.6 


31.7 


25 


DeFrag |15 | 


5.9 


19.2 


27.3 


34 


5.2 


17.6 


26.5 


32 


t DeFrag |15 | 


12.6 


32.9 


44.0 


14 


9.7 


29.6 


42.5 


15 


m-RNN |7| 


14.5 


37.2 


48.5 


11 


11.5 


31.0 


42.4 


15 


Our model 

Our model (OxfordNet) 


13.5 
18.0 


36.2 
40.9 


45.7 
55.0 


13 
8 


10.4 
12.5 


31.0 
37.0 


43.7 
51.5 


14 
10 



Table 1: Flickr8K experiments. R@K is Recall @K (high is good). Med r is the median rank (low is good). 
Best results overall are bold while best results without OxfordNet features are underlined . A f infront of the 
method indicates that object detections were used along with single frame features. 

3 Experiments 

3.1 Image-sentence ranking 

Our main quantitative results is to establish the effectiveness of using an LSTM sentence encoder 
for ranking image and descriptions. We perform the same experimental procedure as done by ifTSll 
on the FlickrSK 1 3 1 and FlickrSOK |42 1 datasets. These datasets come with 8,000 and 30,000 images 
respectively with each image annotated using 5 sentences by independent annotators. As with (151, 
we did not do any explicit text preprocessing. We used two convolutional network architectures 
for extracting 4096 dimensional image features: the Toronto ConvNet|^as well as the 19-layer 
OxfordNet |43 1 which finished 2nd place in the ILSVRC 2014 classification competition. Following 
the protocol of 1 15 1, 1000 images are used for validation, 1000 for testing and the rest are used for 
training. Evaluation is performed using Recall @K, namely the mean number of images for which 
the correct caption is ranked within the top-K retrieved results (and vice- versa for sentences). We 
also report the median rank of the closest ground truth result from the ranked list. We compare our 
results to each of the following methods: 

DeViSE. The deep visual semantic embedding model O was proposed as a way of performing zero- 
shot object recognition and was used as a baseline by |[T5l . In this model, sentences are represented 
as the mean of their word embeddings and the objective function optimized matches ours. 

SDT-RNN. The semantic dependency tree recursive neural network | 6 | is used to learn sentence 
representations for embedding into a joint image-sentence space. The same objective is used. 

DeFrag. Deep fragment embeddings |[T5ll were proposed as an alternative to embedding full-frame 
image features and take advantage of object detections from the R-CNN | 44 1 detector. Descriptions 
are represented as a bag of dependency parses. Their objective incorporates both a global and 
fragment objectives, for which their global objective matches ours. 

m-RNN. The multimodal recurrent neural network fT| is a recently proposed method that uses per- 
plexity as a bridge between modalities, as first introduced by |2]. Unlike all other methods, the 
m-RNN does not use a ranking loss and instead optimizes the log-likelihood of predicting the next 
word in a sequence conditioned on an image. 

Our LSTMs use 1 layer with 300 units and weights initialized uniformly from [-0.08, 0.08]. The 
margin a was set to a = 0.2, which we found performed well on both datasets. Training is done 
using stochastic gradient descent with an initial learning rate of 1 and was exponentially decreased. 
We used minibatch sizes of 40 on Flickr8K and 100 on Flickr30K. No momentum was used. The 
same hyperparameters are used for the OxfordNet experiments. 

3.1.1 Results 

Tables |T]and [2] illustrate our results on Flickr8K and Flickr30K respectively. The performance of 
our model is comparable to that of the m-RNN. For some metrics we outperform or match existing 
results while on others m-RNN outperforms our model. The m-RNN does not learn an explicit em- 
bedding between images and sentences and relies on perplexity as a means of retrieval. Methods that 



'https : //github . com/TorontoDeepLearning/convnet 
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Flickr30K 



Model 


R@l 


Image Annotation 
R@5 R@10 


Med r 


R@l 


Image Search 
R@5 R@10 


Med r 


Random Ranking 


0.1 


0.6 


1.1 


631 


0.1 


0.5 


1.0 


500 


t DeViSE | 5 1 


4.5 


18.1 


29.2 


26 


6.7 


21.9 


32.7 


25 


t SDT-RNN 1^ 


9.6 


29.8 


41.1 


16 


8.9 


29.8 


41.1 


16 


fDeFrag |15 | 


14.2 


37.7 


51.3 


10 


10.2 


30.8 


44.2 


14 


t DeFrag + Finetune CNN fBl 


16.4 


40.2 


54.7 


8 


10.3 


31.4 


44.5 


13 


m-RNN|7| 


18.4 


40.2 


50.9 


10 


12.6 


31.2 


41.5 


16 


Our model 

Our model (OxfordNet) 


14.8 
23.0 


39.2 
50.7 


50.9 
62.9 


10 
5 


11.8 
16.8 


34.0 
42.0 


46.3 
56.5 


13 
8 



Table 2: Flickr30K experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good). 
Best results overall are bold while best results without OxfordNet features are underlined . A f infront of the 
method indicates that object detections were used along with single frame features. 



learn explicit embedding spaces have a significant speed advantage over perplexity -based retrieval 
methods, since retrieval is easily done with a single matrix multiply of stored embedding vectors 
from the dataset with the query vector. Thus explicit embedding methods are much better suited for 
scaling to large datasets. 

Perhaps more interestingly is the fact that both our method and the m-RNN outperform existing 
models that integrate object detections. This is contradictory to |6|, where recurrent networks are the 
worst performing models. This highlights the effectiveness of LSTM cells for encoding dependen- 
cies across descriptions and learning meaningful distributed sentence representations. Integrating 
object detections into our framework should almost surely improve performance as well as allow for 
interpretable retrievals, as in the case of DeFrag. 

Using image features from the OxfordNet model results in a significant performance boost across 
all metrics, giving new state-of-the-art numbers on these evaluation tasks. 



3.2 Multimodal linguistic regularities 

Word embeddings learned with skip-gram f37l or neural language models |45 1 were shown by lfT2ll 
to exhibit linguistic regularities that allow these models to perform analogical reasoning. For in- 
stance, "man" is to "woman" as "king" is to ? can be answered by finding the closest vector to 
"king" - "man" + "woman". A natural question we ask is whether multimodal vector spaces exhibit 
the same phenomenon. Would *image of a blue car* - "blue" + "red" be near images of red cars? 

Suppose that we train an embedding model with a linear encoder, namely v = J^i^i word 
vectors and sentence vector v (where both v and the image embedding are normalized to unit 
length). Using our example above, let ^uue^ ^red and Vcar denote the word embeddings for blue, red 
and car respectively. Let Ihcar and Ircar denote embeddings of images with blue and red cars. After 
training a linear encoder, the model has the property that ^uue + ^car ^ ^hcar and v^ed + ^car ^ 

Ircar- It folloWS that 

^ car Ifecar ^blue (12) 

Vred + Vcar ^ ^bcar " ^blue + ^red (13) 

'^rcar '^bcar ^blue H~ ^red (14) 

Thus given a query image q, a negative word and a positive word Wp (all with unit norm), we 
seek an image x* such that: 



X 



X = argmax^ — — (15) 

X ||q-Wn + Wp|| 

The supplementary material contains qualitative evidence that the above holds for several types 
of regularities and images. |^ In our examples, we consider retrieving the top-4 nearest images. 
Occasionally we observed that a poor result would be obtained within the top-4 among good results. 
We found a simple strategy for removing these cases is to first retrieve the top N nearest images, 
then re-sort these based on their distance to the mean of the N images. 

It is worth noting that these kinds of regularities are not well observed with an LSTM encoder, since 
sentences are no longer just a sum of their words. The linear encoder is roughly equivalent to the 



^For this model we finetune the word representations. 
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DeViSE baselines in tables [^and [2] which perform significantly worse for retrieval than an LSTM 
encoder. So while these regularities are interesting the learned multimodal vector space is not well 
apt for ranking sentences and images. 

3.3 Image caption generation 

We generated image descriptions for roughly 800 images from the SBU captioned photo dataset BOl . 
These are the same images used to display results by the current state-of-the-art composition based 
approach, TreeTalk [27 1. FlOur LSTM encoder and SC-NLM decoder were trained by concatenating 
the FlickrSOK dataset with the recently released Microsoft COCO dataset 1461 . which combined 
give us over 100,000 images and over 500,000 descriptions for training. The SBU dataset contains 1 
million images each with a single description and was used by |27 1 for training their model. While 
the SBU dataset is larger, the annotated descriptions are noisier and more personalized. 

The generate d results can be found at |http : / /www . cs . tor onto . edu/-rkiros/lstm_J 
lscnlm.html For each image we show the original caption, the nearest neighbour sentence 
from the training set, the top-5 generated samples from our model and the best generated result from 
TreeTalk. The nearest neighbour sentence is displayed to demonstrate that our model has not simply 
learned to copy the training data. Our generated descriptions are arguably the nicest ones to date. 

4 Discussion 

When generating a description, it is often the case that only a small region is relevant at any given 
time. We are developing an attention-based model that jointly learns to align parts of captions to 
images and use these alignments to determine where to attend next, thus dynamically modifying the 
vectors used for conditioning the decoder. We also plan on experimenting with LSTM decoders as 
well as deep and bidirectional LSTM encoders. 
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5 Supplementary material: Additional experimentation and details 
5.1 Multimodal linguistic regularities 



Nearest images Nearest images 




(a) Simple cases (b) Colors 



Nearest images Nearest images 




(c) Image structure (d) Sanity check 



Figure 4: Multimodal vector space arithmetic. Query images were downloaded online and retrieved 
images are from the SBU dataset. 




(a) Colors (b) Weather 



Figure 5: PC A projection of the 300-dimensional word and image representations for (a) cars and 
colors and (b) weather and temperature. 

Figure [4| illustrates sample results using a model trained on the SBU dataset. All queries were 
downloaded online and retrieved images are from the SBU images used for training. What is of 
interest to note is that the resulting images depend highly on the image used for the query. For 
example, searching for the word 'night' retrieves arbitrary images taken at night. On the other 
hand, an image with a building predominantly as its focus will return night images when 'day' is 
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subtracted and 'night' is added. A similar phenomenon occurs with the example of cats, bowls and 
boxes. As additional visualizations, we computed PC A projections of cars and their corresponding 
colors as well as images and the weather occurrences in Figure [5] These results give us strong 
evidence for the regularities apparent in multimodal vector spaces trained with linear encoders. Of 
course, sensible results are only likely to be obtained if (a) the content of the image is correctly 
recognized, (b) the subtraction word is relevant to the image and (c) an image exists that is sensible 
for the corresponding query. 

5.2 Image description generation 

The SC-NLM was trained on the concatenation of training sentences from both FlickrSOK and Mi- 
crosoft COCO. Given an image, we first map it into the multimodal space. From this embedding, 
we define 2 sets of candidate conditioning vectors to the SC-NLM: 

Image embedding. The embedded image itself. Note that the SC-NLM was not trained with images 
but can be conditioned on images since the embedding space is multimodal. 

top-A/^ nearest words and sentences. After first computing the image embedding, we obtain the 
top- A' nearest neighbour words and training sentences using cosine similarity. These retrievals are 
treated as a 'bag of concepts' for which we compute an embedding vector as the mean of each 
concept. All of our results use N = b. 

Along with the candidate conditioning vectors, we also compute candidate POS sequences used by 
the SC-NLM. For this, we obtain a set of all POS sequences from the training set whose lengths 
were between 4 and 12, inclusive. Captions are generated by first sampling a conditioning vector, 
next sampling a POS sequence, then computing a MAP estimate from the SC-NLM. We generate 
a large list of candidate descriptions (1000 for each image in our results) and rank these candidates 
using a scoring function. Our scoring function consists of two feature functions: 

Translation model. The candidate description is embedded into the multimodal space using the 
LSTM. We then compute a translation score as the cosine similarity between the image embedding 
and the embedding of the candidate description. This scores how relevant the content of the candi- 
date is to the image. We also augment to this score a multiplicative penalty to non-stopwords that 
appear too frequently in the description. |^ 

Language model. We trained a Kneser-Ney trigram model on a large corpus and compute the log- 
probability of the candidate under the model. This scores how reasonable of an English sentence is 
the candidate. 

The total score of a caption is then the weighted sum of the translation and language models. Due to 
the challenge of quantitatively evaluating generated descriptions, we tuned the weights by hand on 
qualitative results alone. All of the candidate descriptions are ranked by their scores, and the top-5 
captions are returned. 



^For instance, given an image of a car, we would want a candidate to be ranked low if each noun in the 
description was 'car'. 
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